Problem statement: Credit Card Customer Segmentation¶Background:¶Objective: To identify different segments in the existing customer based on their spending patterns as well as past interaction with the bank.¶Key Questions:¶Data Description:¶Customer key - Identifier for the customer
Average Credit Limit - Average credit limit across all the credit cards
Total credit cards - Total number of credit cards
Total visits bank - Total number of bank visits
Total visits online - total number of online visits
Total calls made - Total number of calls made by the customer
Deliverable:¶Perform univariate analysis on the data to better understand the variables at your disposal and to get an idea about the no of clusters. Perform EDA, create visualizations to explore data. (10 marks)
Properly comment on the codes, provide explanations of the steps taken in the notebook and conclude your insights from the graphs. (5 marks)
Execute K-means clustering use elbow plot and analyse clusters using boxplot (10 marks)
Execute hierarchical clustering (with different linkages) with the help of dendrogram and cophenetic coeff. Analyse clusters formed using boxplot (15 marks)
Calculate average silhouette score for both methods. (5 marks)
Compare K-means clusters with Hierarchical clusters. (5 marks)
Analysis the clusters formed, tell us how is one cluster different from another and answer all the key questions. (10 marks)
Deliverable – 1 and 2: Univariate and Bivariant Analysis, EDA and insights¶Import all the header libraries¶import os, sys, re
import numpy as np
import pandas as pd
pd.options.display.float_format = '{:,.4f}'.format
# For Plot
import matplotlib.pyplot as plt
import seaborn as sns
# Add nice background to the graphs
sns.set(color_codes=True)
# To enable plotting graphs in Jupyter notebook
%matplotlib inline
# sklearn libraries
from scipy.stats import zscore
from sklearn.model_selection import train_test_split
from sklearn.cluster import KMeans
from scipy.spatial.distance import cdist
from scipy.spatial.distance import pdist
from scipy.cluster.hierarchy import cophenet, dendrogram, linkage
from sklearn.metrics import silhouette_score
from scipy.cluster.hierarchy import fcluster
import warnings
warnings.filterwarnings('ignore')
Read the input file into pandas dataframe¶# NOTE: Package required for the excel read: pip install xlrd >= 1.2.0
cc_customer_df = pd.read_excel('CreditCardCustomerData.xlsx')
cc_customer_df.head(10)
# Shape of the dataframe
cc_customer_df.shape
# Datatypes of each columns
cc_customer_df.dtypes
Print the missing values using info() and re-validate then using isnull() functions. Compute the descriptive statistics (min, max, mean, median, standard deviation and quartiles) of each & every column using describe() function
# Check the detailed view of the data and see if the data has any null value and also re-verify the data types
cc_customer_df.info()
# Cross validate the Non-Null value reported by info()
cc_customer_df.isnull().sum()
# Analyse the statistical summary and the distribution of the various attributes
cc_customer_df.describe().transpose()
# Let's find all the duplicate rows in the data frame
cc_customer_df[cc_customer_df.duplicated(keep=False)]
**Insights:**
NOTE: I didn't see any duplicate records with the "Sl_No and Customer Key"# Let's drop "Sl_No" and "Customer_Key" colunms as they don't give meaningful info
cc_customer_df = cc_customer_df.iloc[:, 2:]
cc_customer_df.head(10)
# Check the unique values in each column of the dataframe.
cc_customer_df.nunique()
# Let's see if we get duplicate rows records after dropping columns(Sl_No and Customer Key) in the data frame
cc_customer_df[cc_customer_df.duplicated(keep=False)]
# Let's drop all the duplicate rows from the dataframe
cc_customer_df.drop_duplicates(inplace=True, keep="first")
# Shape of the dataframe
cc_customer_df.shape
**Insights:**
After dropping the columns(Sl_No and Customer Key) and checking shape again, noted above that there are 11 duplicate rerords in the data. Dropped "Duplicate" data from the record for better estimates. There are 649 rows and 5 columns now in the dataset.
# Let's visualize the individual data and see how do they look. This helps in further decision making about the data
cc_customer_df[cc_customer_df.columns].hist(stacked=False, bins=50, figsize=(20,20), layout=(5,1));
# Let's do dist plot of each variables
for i in cc_customer_df.columns:
sns.distplot(cc_customer_df[i],hist=False, bins=50)
plt.show()
# Let's implement the valueCount function
def value_count(pd_df=None):
columns = pd_df.columns
for col in columns:
print('value_counts for {}'.format(col))
print(pd_df[col].value_counts(normalize=True).head(10))
print()
# Let's print the value count value of all the variables.
value_count(cc_customer_df)
# Let's see the customer's various account relationship for Total_Credit_Cards
cc_customer_df.groupby('Total_Credit_Cards').mean()
# Let's plot pair plots to see the relation between variables
plt.figure(figsize=(20,5))
sns.pairplot(cc_customer_df, diag_kind='kde')
plt.show()
# Let's plot the box plot between Total_Credit_Cards and Avg_Credit_Limit
# Set the plot window size
plt.figure(figsize=(20,5))
sns.boxplot(cc_customer_df['Total_Credit_Cards'], cc_customer_df['Avg_Credit_Limit']);
plt.show()
# Let's plot the box plot between Total_Credit_Cards and Total_visits_online
# Set the plot window size
plt.figure(figsize=(20,5))
sns.boxplot(cc_customer_df['Total_Credit_Cards'], cc_customer_df['Total_visits_online']);
plt.show()
# Let's plot the box plot between Total_Credit_Cards and Total_visits_bank.
# Set the plot window size
plt.figure(figsize=(20,5))
sns.boxplot(cc_customer_df['Total_Credit_Cards'], cc_customer_df['Total_visits_bank']);
plt.show()
# Let's plot the box plot between Total_Credit_Cards and Total_calls_made
# Set the plot window size
plt.figure(figsize=(20,5))
sns.boxplot(cc_customer_df['Total_Credit_Cards'], cc_customer_df['Total_calls_made']);
plt.show()
**Insights:**
Following observations are made from the univariant and bivariant plots, value_counts() and groupby functions:
The number of customers with Avg_Credit_limits < 25000 are the largest samples and they hold <= 3 credit cards. the significant number of the customers have Avg_credit_limits < 75000.
The customers with 25000 <= Avg_Credit_limits <= 75000 holds the 4 to 7 cards.
The customers with > 100000 of Avg_Credit_limits holds more then 7 credit cards.
Customers with 3-4 creditcards were contacted by call(> 6 times) compared to other methods.
Customers who holds 4-7 credit cards prefer more to go to bank for credicards compared to other methods and made average 2-4 visits to the bank.
Customers who holds more than 7 creditcards prefers to visit online for creditcards compared to other methods and visited 6-14 times online.
There are few outliers in the samples who holds 7 creditcards.
Based on the pair and dist plot, there are 4 or 5 clusters possible
# NOTE: The dataset columns are in different scale and that may influnce the results so lets to scale them
cc_customer_df_z = cc_customer_df.apply(zscore)
# Let's see the co-relation of scaled data
# Set the plot window size
plt.figure(figsize=(15, 4))
cc_corr = cc_customer_df_z.corr()
sns.heatmap(cc_corr, annot = True)
Deliverable – 3: K-means clustering using elbow plot and analyse clusters using boxplot¶# Let's find optimal no. of clusters with Metric=euclidean distance
clusters=range(1,10)
meanDistortions=[]
for k in clusters:
model=KMeans(n_clusters=k)
model.fit(cc_customer_df_z)
prediction=model.predict(cc_customer_df_z)
meanDistortions.append(sum(np.min(cdist(cc_customer_df_z, model.cluster_centers_, metric='euclidean'), axis=1))/ cc_customer_df_z.shape[0])
plt.plot(clusters, meanDistortions, 'bx-')
plt.xlabel('k')
plt.ylabel('Average distortion')
plt.title('Selecting k with the Elbow Method using euclidean metric')
Let's compute KMeans for n_clusters=k=3¶"""
Fit and Predict Kmean using the given input of k(n_clusters)
Store results in the new columned "GROUP" and print the results
"""
model_3 = KMeans(3)
model_3.fit(cc_customer_df_z)
prediction=model_3.predict(cc_customer_df_z)
#Append the prediction
cc_customer_df["GROUP"] = prediction
cc_customer_df_z["GROUP"] = prediction
print("Groups Assigned : \n")
print(cc_customer_df.head())
# Use groupby menthod and generate new df for groupby and compute the mean and print it
cc_customer_df_clusters_3 = cc_customer_df.groupby(['GROUP'])
cc_customer_df_clusters_3.mean()
model_3.cluster_centers_
model_3.labels_
# The silhouette_score gives the average value for all the samples.
# This gives a perspective into the density and separation of the formed clusters
silhouette_avg_3 = silhouette_score(cc_customer_df_z, model_3.labels_)
print(silhouette_avg_3)
# Let's plot the box plot for the scaled df using GROUP columns.
# cc_customer_df_z.boxplot(by='GROUP', layout=(2,4), figsize=(15,10)) # Some how this gives error on my computer. so using alternate way.
df_columns = cc_customer_df_z.columns[:-1]
# print(df_columns)
for col in df_columns:
# cc_customer_df_z.boxplot(by='GROUP', column=col, figsize=(5,5))
plt.figure(figsize=(7,5))
sns.boxplot(cc_customer_df_z['GROUP'], cc_customer_df_z[col])
plt.show()
Let's compute KMeans for n_clusters=k=5¶# Let's reset the dataframes as it shows variations in following steps
cc_customer_df = cc_customer_df.drop('GROUP', axis=1)
print(cc_customer_df.head())
cc_customer_df_z = cc_customer_df_z.drop('GROUP', axis=1)
print(cc_customer_df_z.head())
# Let's compute Kmean for k=5
"""
Fit and Predict Kmean using the given input of k(n_clusters)
Store results in the new columned "GROUP" and print the results
"""
model_5 = KMeans(5)
model_5.fit(cc_customer_df_z)
prediction=model_5.predict(cc_customer_df_z)
#Append the prediction
cc_customer_df["GROUP"] = prediction
cc_customer_df_z["GROUP"] = prediction
print("Groups Assigned : \n")
print(cc_customer_df.head())
# Use groupby menthod and generate new df for groupby and compute the mean and print it
cc_customer_df_clusters_5 = cc_customer_df.groupby(['GROUP'])
cc_customer_df_clusters_5.mean()
model_5.cluster_centers_
model_5.labels_
# The silhouette_score gives the average value for all the samples.
# This gives a perspective into the density and separation of the formed clusters
silhouette_avg_5 = silhouette_score(cc_customer_df_z, model_5.labels_)
print(silhouette_avg_5)
# Let's plot the box plot for the scaled df using GROUP columns.
# cc_customer_df_z.boxplot(by='GROUP', layout=(2,4), figsize=(15,10)) # Some how this gives error on my computer. so using alternate way.
df_columns = cc_customer_df_z.columns[:-1]
# print(df_columns)
for col in df_columns:
# cc_customer_df_z.boxplot(by='GROUP', column=col, figsize=(5,5))
plt.figure(figsize=(7,5))
sns.boxplot(cc_customer_df_z['GROUP'], cc_customer_df_z[col])
plt.show()
Let's compute KMeans for n_clusters=k=4¶# Let's reset the dataframes as it shows variations in following steps
cc_customer_df = cc_customer_df.drop('GROUP', axis=1)
print(cc_customer_df.head())
cc_customer_df_z = cc_customer_df_z.drop('GROUP', axis=1)
print(cc_customer_df_z.head())
# Let's compute Kmean for k=4
"""
Fit and Predict Kmean using the given input of k(n_clusters)
Store results in the new columned "GROUP" and print the results
"""
model_4 = KMeans(4)
model_4.fit(cc_customer_df_z)
prediction=model_4.predict(cc_customer_df_z)
#Append the prediction
cc_customer_df["GROUP"] = prediction
cc_customer_df_z["GROUP"] = prediction
print("Groups Assigned : \n")
print(cc_customer_df.head())
# Use groupby menthod and generate new df for groupby and compute the mean and print it
cc_customer_df_clusters_4 = cc_customer_df.groupby(['GROUP'])
cc_customer_df_clusters_4.mean()
model_4.cluster_centers_
model_4.labels_
# The silhouette_score gives the average value for all the samples.
# This gives a perspective into the density and separation of the formed clusters
silhouette_avg_4 = silhouette_score(cc_customer_df_z, model_4.labels_)
print(silhouette_avg_4)
# Let's plot the box plot for the scaled df using GROUP columns.
# cc_customer_df_z.boxplot(by='GROUP', layout=(2,4), figsize=(15,10)) # Some how this gives error on my computer. so using alternate way.
df_columns = cc_customer_df_z.columns[:-1]
# print(df_columns)
for col in df_columns:
# cc_customer_df_z.boxplot(by='GROUP', column=col, figsize=(5,5))
plt.figure(figsize=(7,5))
sns.boxplot(cc_customer_df_z['GROUP'], cc_customer_df_z[col])
plt.show()
** Final Kmean Summary **¶n_clusters=k=4
silhouette_score = silhouette_avg_4 = 0.5942080478916576
The data is distributed into 4 groups.
The 1st group(0) customers have an average credit balance of 31,832 and hold 5 credit cards and prefers bank visits
The 2nd group(1) customers have an average credit balance of 12,233 and hold around 2 credit cards and prefer the calls followd by online visit
The 3rd group(2) customers have an average credit balance of 141,040 and holds around 9 credit cards and prefer the online visit compare to other communication methods
The 4rd group(3) customers have an average credit balance of 35,857 and holds around 6 credit cards and prefer the bank visit and calls
Deliverable – 4: Hierarchical clustering (with different linkages) with the help of dendrogram and cophenetic coeff. Analyse clusters formed using boxplot¶NOTE: cophenet index is a measure of the correlation between the distance of points in feature space and distance on dendrogram. Closer it is to 1, the better is the clustering¶# NOTE: Let's copy the kmean result into new df and drop the "GROUP" column from the scaled dataset cc_customer_df_z
cc_customer_df_z_kmean_G = cc_customer_df_z.copy()
print (cc_customer_df_z_kmean_G.head())
# Let's reset the dataframes as it shows variations in following steps
cc_customer_df = cc_customer_df.drop('GROUP', axis=1)
print(cc_customer_df.head())
cc_customer_df_z = cc_customer_df_z.drop('GROUP', axis=1)
print(cc_customer_df_z.head())
cc_customer_df_z_H_C = cc_customer_df_z.copy()
print (cc_customer_df_z_H_C.head())
# Let's cumpute the linkage for method=Average
Avg_L = linkage(cc_customer_df_z, metric='euclidean', method='average')
c_avg, coph_dists_avg = cophenet(Avg_L , pdist(cc_customer_df_z))
c_avg
# Let's plot dendrogram for Average Linkage
plt.figure(figsize=(20, 5))
plt.title('Average Linkage method Hierarchical Clustering Dendogram')
plt.xlabel('sample index')
plt.ylabel('Distance')
dendrogram(Avg_L, p=10, leaf_rotation=90., color_threshold=40,leaf_font_size=8.,truncate_mode='level')
plt.tight_layout()
max_d=5
clusters_avg = fcluster(Avg_L, max_d, criterion='distance')
#clusters_avg
# NOTE: Verified with various max_d from 3, 4, 5, 6. The max_d=5 gives the best score
silhouette_avg_avg = silhouette_score(cc_customer_df_z, clusters_avg)
print(silhouette_avg_avg)
# Let's add Clusters into a scaled data to analyse the box plot
cc_customer_df_z_H_C["CLUSTERS"] = clusters_avg
# print (cc_customer_df_z)
# Let's plot the box plot for the scaled df using CLUSTERS columns.
# cc_customer_df_z_H_C.boxplot(by='CLUSTERS', layout=(2,4), figsize=(15,10)) # Some how this gives error on my computer. so using alternate way.
df_columns = cc_customer_df_z_H_C.columns[:-1]
# print(df_columns)
for col in df_columns:
# cc_customer_df_z_H_C.boxplot(by='CLUSTERS', column=col, figsize=(5,5))
plt.figure(figsize=(7,5))
sns.boxplot(cc_customer_df_z_H_C['CLUSTERS'], cc_customer_df_z_H_C[col])
plt.show()
# Let's reset the dataframes as it shows variations in following steps
cc_customer_df_z_H_C = cc_customer_df_z_H_C.drop('CLUSTERS', axis=1)
print(cc_customer_df_z_H_C.head())
# Let's cumpute the linkage for method=Complete
Cmplt_L = linkage(cc_customer_df_z, metric='euclidean', method='complete')
c_cmplt, coph_dists_cmplt = cophenet(Cmplt_L , pdist(cc_customer_df_z))
c_cmplt
# Let's plot dendrogram for Complete Linkage
plt.figure(figsize=(20, 5))
plt.title('Complete Linkage method Hierarchical Clustering Dendogram')
plt.xlabel('sample index')
plt.ylabel('Distance')
dendrogram(Cmplt_L, p=10, leaf_rotation=90.,color_threshold=40,leaf_font_size=8.,truncate_mode='level')
plt.tight_layout()
max_d=8
clusters_cmplt = fcluster(Cmplt_L, max_d, criterion='distance')
# clusters_cmplt
# NOTE: Verified with various max_d from 4, 5, 6, 7, 8. The max_d=8 or 7 gives the best score
silhouette_avg_cmplt = silhouette_score(cc_customer_df_z, clusters_cmplt)
print(silhouette_avg_cmplt)
# Let's add Clusters into a scaled data to analyse the box plot
cc_customer_df_z_H_C["CLUSTERS"] = clusters_cmplt
# print (cc_customer_df_z)
# Let's plot the box plot for the scaled df using CLUSTERS columns.
# cc_customer_df_z_H_C.boxplot(by='CLUSTERS', layout=(2,4), figsize=(15,10)) # Some how this gives error on my computer. so using alternate way.
df_columns = cc_customer_df_z_H_C.columns[:-1]
# print(df_columns)
for col in df_columns:
# cc_customer_df_z_H_C.boxplot(by='CLUSTERS', column=col, figsize=(5,5))
plt.figure(figsize=(7,5))
sns.boxplot(cc_customer_df_z_H_C['CLUSTERS'], cc_customer_df_z_H_C[col])
plt.show()
# Let's reset the dataframes as it shows variations in following steps
cc_customer_df_z_H_C = cc_customer_df_z_H_C.drop('CLUSTERS', axis=1)
print(cc_customer_df_z_H_C.head())
# Let's cumpute the linkage for method=centroid
ctroid_L = linkage(cc_customer_df_z, metric='euclidean', method='centroid')
c_ctroid, coph_dists_ctroid = cophenet(ctroid_L , pdist(cc_customer_df_z))
c_ctroid
# Let's plot dendrogram for centroid Linkage
plt.figure(figsize=(20, 5))
plt.title('Centroid Linkage method Hierarchical Clustering Dendogram')
plt.xlabel('sample index')
plt.ylabel('Distance')
dendrogram(ctroid_L,p=10, leaf_rotation=90.,color_threshold=40,leaf_font_size=8.,truncate_mode='level')
plt.tight_layout()
max_d=4
clusters_ctroid = fcluster(ctroid_L, max_d, criterion='distance')
# clusters_ctroid
# NOTE: Verified with various max_d from 3, 4, 5, The max_d=4 gives the best score
silhouette_avg_ctroid = silhouette_score(cc_customer_df_z, clusters_ctroid)
print(silhouette_avg_ctroid)
# Let's add Clusters into a scaled data to analyse the box plot
cc_customer_df_z_H_C["CLUSTERS"] = clusters_ctroid
# print (cc_customer_df_z)
# Let's plot the box plot for the scaled df using CLUSTERS columns.
# cc_customer_df_z_H_C.boxplot(by='CLUSTERS', layout=(2,4), figsize=(15,10)) # Some how this gives error on my computer. so using alternate way.
df_columns = cc_customer_df_z_H_C.columns[:-1]
# print(df_columns)
for col in df_columns:
# cc_customer_df_z_H_C.boxplot(by='CLUSTERS', column=col, figsize=(5,5))
plt.figure(figsize=(7,5))
sns.boxplot(cc_customer_df_z_H_C['CLUSTERS'], cc_customer_df_z_H_C[col])
plt.show()
# Let's reset the dataframes as it shows variations in following steps
cc_customer_df_z_H_C = cc_customer_df_z_H_C.drop('CLUSTERS', axis=1)
print(cc_customer_df_z_H_C.head())
# Let's cumpute the linkage for method=ward
ward_L = linkage(cc_customer_df_z, metric='euclidean', method='ward')
c_ward, coph_dists_ward = cophenet(ward_L , pdist(cc_customer_df_z))
c_ward
# Let's plot dendrogram for ward Linkage
plt.figure(figsize=(20, 5))
plt.title('ward Linkage method Hierarchical Clustering Dendogram')
plt.xlabel('sample index')
plt.ylabel('Distance')
dendrogram(ward_L, p=10, leaf_rotation=90., color_threshold=40,leaf_font_size=8.,truncate_mode='level')
plt.tight_layout()
max_d=4
clusters_ward = fcluster(ward_L, max_d, criterion='distance')
# clusters_ward
# NOTE: Verified with various max_d from 3, 4, The max_d=4 gives the best score
silhouette_avg_ward = silhouette_score(cc_customer_df_z, clusters_ward)
print(silhouette_avg_ward)
# Let's add Clusters into a scaled data to analyse the box plot
cc_customer_df_z_H_C["CLUSTERS"] = clusters_ward
# print (cc_customer_df_z)
# Let's plot the box plot for the scaled df using CLUSTERS columns.
# cc_customer_df_z_H_C.boxplot(by='CLUSTERS', layout=(2,4), figsize=(15,10)) # Some how this gives error on my computer. so using alternate way.
df_columns = cc_customer_df_z_H_C.columns[:-1]
# print(df_columns)
for col in df_columns:
# cc_customer_df_z_H_C.boxplot(by='CLUSTERS', column=col, figsize=(5,5))
plt.figure(figsize=(7,5))
sns.boxplot(cc_customer_df_z_H_C['CLUSTERS'], cc_customer_df_z_H_C[col])
plt.show()
** Final Hierarchical clustering Summary **¶Looking at the various Hierarchical clustering, the "Average" and "Centroid" method gives best and very close C coefficient of 0.89.
Compare K-means clusters with Hierarchical clusters¶The k-means algorithm is in general very quickly.
However, it is not guarantee to find the "optimal" set of clusters. The results depend on the initial set of centroids and we run the algorithm several times and select the best result ( which has a mallest overall variance.
Also, it is very difficult to determine the optimal number of clusters. It needs to specify that manually However, again, we can run the algorithm several times, with different values of k from elbow, and identify at what point the intra-cluster distances don't improve significantly
In hierarchical classification, the distances between each and every point are calculated. For a large dataset, this can be very slow and require a lot of memory. Therefore, hierarchical clustering is best suited to small data sizes.
However, unlike k-means, hierarchical classification does not require specifying a number of classes beforehand. We can ask the algorithm to generate the whole tree and then read off different numbers of classes.
Analysis the clusters formed, tell us how is one cluster different from another and answer all the key questions¶Key Questions:¶considering the communnication prefereces, there are 4 segment of the customers. ie
Considering the credit card segments, there are 4 segments as well ie.
Also, these customers shows different communication preferences. The communication prefereces are mostly consistant. Also, based on the credit limit and cards, it is easy to target customers based on specific income group with few excetions.
We don't have spending and other data available so this limits our detail analyis of their speding behaviours but,
It is safe to assume that customers with fewer creditcard has lower credit balances and mostly they will spend less using credit card with few exceptions.
It is safe to assume that customer with large credit limit and more credit card will spend more amount using credit card.
Following are recommandation:
Looking at the data,
The data is distributed into 4 groups.
The customers with lower credit balance of and hold lessthen 3 credit cards prefer the calls followed by online visit so bank should call them directly to promote the credit card purchase. Also you can use online advertisement to target this segment.
The customers have an average credit balance of 35000 and hold 5-6 credit cards prefers bank visits so bank should target/promote similar customers to apply when they physically visit bank
The customers have an average credit balance of 120K and holds around 9 credit cards prefer the online visit for applying credit cards so Bank should spend more money on online advertisement to attract the specific segment.
The recommanded strategy will help bank to increase their credit card selling